This adapter provides functions to parse files stored in HDFS in various formats using Apache Tika library. It is described in the following topics:
To use the built-in functions in your query, you must import the Tika file module as follows:
import module "oxh:tika";
The Tika file module contains the following functions:
For examples, see "Examples of Tika File Adapter Functions."
Parses files stored in HDFS in various formats and extracts the content or metadata from them.
declare %tika:collection function tika:collection($uris as xs:string*) as document-node()* external; declare %tika:collection function function tika:collection($uris as xs:string*, $contentType as xs:string?) as document-node()* external;
$uris
: The HDFS file URIs.
$contentType
: Specifies the media type of the content to parse, and may have the charset attribute. When the parameter is specified, then it defines both type and encoding. When not specified, then Tika will attempt to auto-detect values from the file extension. Oracle recommends you to specify the parameter.
Returns a document node for each value. See "Tika Parser Output Format".
Parses the data given to it as an argument.For example, it can parse an html fragment within an XML or JSON document.
declare function tika:parse($data as xs:string?, $contentType as xs:string?) as document-node()* external;
$data
: The value to be parsed.
$contentType
: Specifies the media type of the content to parse, and may have the charset attribute. When the parameter is specified, then it defines both type and encoding. When not specified, then Tika will attempt to auto-detect values from the file extension. Oracle recommends you to specify the parameter.
Returns a document node for each value. See "Tika Parser Output Format".
You can use the following annotations to define functions to parse files in HDFS with Tika. These annotations provide additional functionality that is not available using the built-in functions.
Custom functions for reading HDFS files must have one of the following signatures:
declare %tika:collection [additional annotations] function local:myFunctionName($uris as xs:string*, $contentType as xs:string?) as document-node()* external; declare %tika:collection [additional annotations] function local:myFunctionName($uris as xs:string*) as document-node()* external;
Identifies an external function to be implemented by Tika file adapter. Required.
The optional method parameter can be one of the following values:
tika
: Each line in the tika file is returned as document-node()
. Default.
Declares the file content type. It is a MIME type and must not have the charset attribute as per XQuery specifications. Optional.
Declares the file character set. Optional.
Note:
%output:media-type
and %output:econding
annotations specify the content type or encoding when the $contentType parameter is not explicitly provided in the signature.Lists the HDFS file URIs. Required.
The file content type. It may have the charset attribute.
document-node()*
with two root elements. See "Tika Parser Output Format".
The result of Tika parsing is a document node with two root elements:
Root element #1 is an XHTML content produced by Tika.
Root element #2 is the document metadata extracted by Tika.
The format of the root elements look like these:
<html xmlns="http://www.w3.org/1999/xhtml"> ...textual content of Tika HTML... </html>
<tika:metadata xmlns:tika="oxh:tika"> <tika:property name="Name_1">VALUE_1</tika:property> <tika:property name="NAME_2">VALUE_2</tika:property> </tika:metadata>
The following Hadoop properties control the behavior of Tika adapter:
Type:Boolean
Default Value: false.
Description: When this is set to TRUE, then all the HTML elements are omitted during parsing. When this is set to FALSE, then only the safe elements are omitted during parsing.
Type:Comma-separated list of strings
Default Value:Not Defined.
Description:Defines the locale to be used by some Tika parsers such as Microsoft Office document parser. Only three strings are allowed: language, country, and variant. The strings country and variant are optional. When locale is not defined, then the system locale is used. When the strings are defined it must correspond to the java.util.Locale
specification format mentioned in http://docs.oracle.com/javase/7/docs/api/java/util/Locale.html
and the locale can be constructed as follows:
If only language is specified, then the locale is constructed from the language.
If the language and country are specified, then the locale is constructed from both language and country
If language, country, and variant are specified, then the locale is constructed from language, country, and variant.
This example query uses Tika to parse PDF files into HTML form and then add the HTML documents into Solr's full-text index.
*bigdata*.pdf
The following query indexes the HDFS files:
import module "oxh:tika"; import module "oxh:solr"; for $doc in tika:collection("*bigdata*.pdf") let $docid := data($doc//*:meta[@name eq "resourceName"]/@content)[1] let $body := $doc//*:body[1] return solr:put( <doc> <field name="id">{ $docid }</field> <field name="text">{ string($body) }</field> <field name="content">{ serialize($doc/*:html) }</field> </doc> )
The HTML representation of the documents is added to Solr index and they become searchable. Each document Id in the index is the file name.
This example query uses sequence files and Tika to parse, where key is an URL and value is a html.
import module "oxh:tika"; import module "oxh:solr"; import module "oxh:seq"; for $doc in seq:collection-tika(”/path/to/seq/files/*") let $docid := document-uri($doc) let $body := $doc//*:body[1] return solr:put( <doc> <field name="id">{ $docid }</field> <field name="text">{ string($body) }</field> <field name="content">{ serialize($doc/*:html) }</field> </doc> )
The HTML representation of the documents is added to Solr index and they become searchable. Each document Id in the index is the file name.